Resampling methods for document clustering 
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We compare the performance of different clustering algorithms applied to the task of unsu- 
pervised text categorization. We consider agglomerative clustering algorithms, principal direction 
divisive partitioning and (for the first time) superparamagnetic clustering with several distance mea- 
sures. The algorithms have been applied to test databases extracted from the Reuters-21578 text 
categorization test database. We find that simple application of the different clustering algorithms 
yields clustering solutions of comparable quality. In order to achieve considerable improvements of 
the clustering results it is crucial to reduce the dictionary of words considered in the representation 
of the documents. Significant improvements of the quality of the clustering can be obtained by 
identifying discriminative words and filtering out indiscriminative words from the dictionary. We 
present two methods, each based on a resampling scheme, for selecting discriminative words in an 
unsupervised way. 

Keywords: clustering, text categorization, document classification, feature selection, random sub- 
sampling 



I. INTRODUCTION 



Automatic text categorization has many interesting 
applications in science and business. For instance for- 
matting the results of a web-search query, sorting news 
messages according to topics or sorting incoming email 
with different concerns. Irrespective of the application 
two principal cases are distinguished. In the first case 
the categories are known and the algorithm should as- 
sign any document to one of the known categories. Then 
it is useful to teach the algorithm the considered cate- 
gories and their word fields by using a training set of 
labeled documents before it will be applied to unknown 
documents. The algorithmic solutions to this problem 
fall into the category of supervised learning. In the other 
case the categorization of documents has to be done with- 
out knowing the categories nor their number. Then the 
algorithm should find a reasonable partition of the doc- 
ument set such that documents in the same subset of 
the partition are similar and documents of different sub- 
sets are dissimilar. This task is termed unsupervised text 
categorization and can be handled with clustering algo- 
rithms §J|. 

In this work we focus on the second task only. As an 
illustrative example for an application one could think 
of the results obtained from web search engines. Usu- 
ally the query results are on several subjects and only a 
fraction of the documents is about what one is interested 
in. It will help the user if instead of an unsorted list the 
results are presented in several folders that gather web- 
documents of similar content. Further each folder could 
be characterized by a list of key words. Then one can 
investigate the mass of documents that match a query 
in a more efficient way and the indicated keywords may 
help refining the search. In the general case of web search 
results we are not supplied with any training data such 
that we can only use clustering algorithms in order to 
classify the links. Such a combination of a web search 



engine and a clustering tool has been proposed e. g. by 
Boley §. 

The aims of this work are threefold. First we want to 
compare several methods measuring their performance 
on unsupervised text categorization. Second we want to 
apply superparamagnetic clustering (SPC) a rather 
new method that has so far not been considered for text 
categorization. And third we want to present two meth- 
ods of unsupervised feature selection and estimate the 
improvements that can be achieved by their application. 



II. CLUSTERING ALGORITHMS 

Clustering of data is usually done in three steps: rep- 
resentation, calculation of similarities and application of 
a clustering algorithm. As mentioned above the aim is 
to provide a partition of a data set X that reflects the 
similarities between data points. So we should define a 
similarity measure s : X x X —>■ H which in turn requires 
a numerical representation of the data. 

Usually the representation is done by constructing a 
vector space spanned by a set of selected features of the 
data. This means that one defines certain features and 
for all data assigns numbers according to how much the 
features apply. Then a data point is represented by a 
vector in the feature space. 

For text categorization one commonly uses the "bag 
of words" representation. In order to do so one enlists a 
dictionary W — {u>i, W2, ■ ■ ■ , w m } of all the words that 
appear at least once in at least two of the documents. 
Documents are then represented by counting the num- 
ber of occurrences of each word in the document. One 
thus obtains an n x m-matrix F — ( f a i) of word frequen- 
cies. f a i is the number of times the word Wi appears 
in document x a and document x a is represented by the 
feature (row-)vector (f a i)i=i,...,m- Typically that feature 
space is very high-dimensional and the matrix is filled 
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very sparsely, in our case the fraction of nonzero entries 
is a 2%. 

To measure the similarity of two documents one could 
consider for instance the dot product of two correspond- 
ing normalized feature vectors. Alternatively one can of 
course use a dissimilarity measure. The choice of the 
similarity measure has to be done carefully and influ- 
ences the performance of the clustering. See e.g. || for 
a comparison of some similarity measures used for text 
categorization. 

For the text categorization we found useful the li- and 
i2-norms as well as other dissimilarity measures (see be- 
low). Generally, in order to avoid skewness of the data 
due to the different length of the documents, it is help- 
ful to normalize the data such that the length of a row 
vector is one. 

Finally, given the similarity measure s the task of a 
clustering algorithm is to compute a clustering solution, 
i.e. a partition of the set of data points X into subsets 
(clusters) {C\, C2, . . . , C n } such that s(x a ,xp) is large 
when x a and xp are in the same cluster and s(x a , Xp) is 
small when x a and xp are in different clusters. 

This is a rather fuzzy description of the aim of a clus- 
tering algorithm and we are not going to refine that point 
here. A good clustering solution can be found at differ- 
ent resolutions, so that a proper optimization problem 
can not easily be formulated. Consider for example the 
biological classification of the animals. There we find 
phyla that are further subdivided into classes, orders, 
etc. and each level of partitioning has some justification. 
If we consider a clustering of the animals cat, dog, jelly- 
fish, mouse and snake, we find dog and cat in the same 
cluster but well separated from the jellyfish when we a 
take a look from a large distance and consider a coarse 
classification. A finer classification however will separate 
dog and cat into different clusters. 

For many data sets this resolution is an arbitrary pa- 
rameter which has to be determined in accordance with 
the desired classification task. We therefore do not con- 
sider a single partition of the data set as a clustering so- 
lution but rather a hierarchy of partitions with increasing 
resolution that can be represented in a tree. 

Agglomerative clustering methods j| successively 
merge two clusters until finally all data points are united. 
By doing so these methods implicitly provide such a tree. 
SPC and &;-means provide single partitions which depend 
on a resolution parameter that has to be specified. Run- 
ning these algorithms with different values of the resolu- 
tion parameter yields several partitions that can be trans- 
formed into a tree. Generally the partitions obtained at 
the next higher resolution are not proper subpartitions. 
In order to fix this one usually considers the intersections 
of high resolution clusters and low resolution clusters. 

Each triplet of a representation, a distance measure 
and a clustering algorithm is considered as a clustering 
method. In section |Fv] we specify in detail the methods 
we apply to the test data. All methods we compare yield 
a hierarchical clustering tree. 



For practical purposes it is then often important to re- 
duce the amount of information in the tree, and present 
only some selected clusters as the essence of the cluster- 
ing. Then one can apply a search algorithm that selects 
the "most meaningful" clusters in the tree for presenta- 
tion as the clustering result. 

Applying an algorithm that searches a tree for good 
clusters can be considered the fourth step of clustering 
and can be done in various ways. When one is using SPC 
one can look at the change of the susceptibility versus the 
temperature and from that function one can determine 
the "best" resolution || . But the search is not restricted 
to finding an optimal single resolution. For the text cat- 
egorization task we found that the natural classes of the 
documents as classified by human readers, and so spec- 
ified by the labels, are best approximated at different 
levels of resolution. Therefore we prefer other methods 
that individually judge a single cluster as good or bad 
and therefore allow picking clusters from different levels 
of resolution. This can be done by measuring the stabil- 
ity of the cluster with respect to the resolution parameter 
j7j or with respect to thinning out the dataset by consid- 
ering subsamples 

However, the unsupervised identification of good clus- 
ters is a complex issue that we do not discuss in this 
paper, cp. section IV C| . 



III. 



FEATURE SELECTION: FINDING 
DISCRIMINATIVE WORDS 



A crucial point within the representation of the docu- 
ments that bears some potential to improve the cluster- 
ing results is the selection of words from the dictionary. 
Everybody will immediately agree that the words "and" , 
"or", "while" and "with" are useless for document cat- 
egorization. These words are not characteristic for the 
content of the document and spoil the signal to noise 
ratio in the representation. 

Usually words like prepositions, conjunctions, etc. are 
read from a stoplist that contains about 400 known stop- 
words. They are taken out of the dictionary, i.e. out of 
the feature set and, as we will show in our results, this 
rejection of unwanted features yields some improvement 
on the results. 

So far this is not new, but this procedure, however, 
does only a part of the job. A more difficult problem is 
to get rid of those many noise words that are not on the 
stoplist. 

After the application of the stoplist we remain with 
5036 and 9019 words respectively to our two test 
databases. On the other hand, looking at the DSR exper- 
iments that are described below, we find improved clus- 
tering results on the basis of 350 automatically selected 
words. So more than 90% of the words that are not on 
the stoplist are not needed. More than that: they make 
the task harder because they add noise to the document 
representation. 
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Finding those good words is not easy. Whether a word 
is a useful feature for document classification depends 
on the categories that appear in the document set and 
can not be told by looking at the mere word. The word 
"Italy" for instance might be discriminative in some set 
of documents where one of the categories is tourism, but 
it can be a pure noise feature in another set of documents. 

A widely used method is the application of lower and 
upper thresholds for the coverage of a word, i.e. the num- 
ber of documents a word appears in. This can improve 
the results but the quality of the clustering depends sen- 
sitively on these thresholds and one can not tell in gen- 
eral which values of these thresholds are the best. An 
alternative way to select relevant features, that has been 
proven useful for the analysis of microarray data, is the 
application of clustering algorithms to the feature set . 

In order to identify good words and throw away the bad 
words in an unsupervised way we developed two strate- 
gies that are both based on resampling. 



A. Word set resampling (WSR) 

Experiments with different thresholds for minimal and 
maximal coverage have shown that choosing a different 
word subset for clustering the same database can strongly 
affect the global clustering tree structure. Such sensi- 
tivity is probably due to "shot noise" arising from the 
finite size of the dictionary (and an even smaller num- 
ber of words that appear in a single category or a single 
document). One way to eliminate such noise is taking 
different, e.g. randomly chosen, word subsets from the 
dictionary, clustering the documents on the basis of these 
subsets and then averaging the result somehow. This is 
the idea of the word set resampling algorithm described 
in the following. 

Let us first explain the procedure that is applied for 
each word subset ("probe clustering"). We cluster the 
documents represented only through the words in the 
subset by applying an agglomerative clustering algorithm 
(see below). Each agglomeration process is continued un- 
til the stop criterion (|l|) holds where C\ and Ci are 



Hi = - (pa logpa + p i2 logp i2 ) . 



(3) 



the two largest clusters and |CtJ< Cil : 



\Ci\ 



I C 2 I > 1 



f 



\x\ 



(1) 



The clustering is then considered as "good" if the follow- 
ing quality condition is not violated. 



iXl-ldl-ICal < |C 2 



(2) 



One can see that if both conditions (|I|) and (g) hold, then 
Ci, Ci are of comparable size and contain the majority 
of the documents (there is no other aim of ([!]) and @)- 
If the probe clustering is not "good" then it is rejected. 
For "good" clusterings we calculate for every word the 
entropy with respect to the two biggest clusters: 



where 



Pa 



{X a \fai > 0} nc 3 



{x a \f m >o}n(c 1 uc 2 ) 



J = 1,2. (4) 



Further, at that stage we check for each pair of docu- 
ments if they are in the same cluster: 

„ , | 1. if x a , xr are in the same cluster; . . 
M ^ = \0, otherwise. (5) 

We repeat this probe clustering until we have Ar "good" 
ones. The word subsets are obtained by throwing out one 
randomly drawn word from each document. In order to 
avoid empty documents in some cases some of the thrown 
out words have to be replaced. 

After doing all subsample clusterings we average Hi 
and M a p over all Ar "good" probe clusterings. Intu- 
itively we consider (Hi) as the quality of the word Wi, 
and (M a p) as similarity of the documents x a and xp 
((...) denotes average). We throw away all the words 
whose average entropy exceeds the threshold: 



(Hi) > 01og2, 



(0) 



where e [0,1]. 

Next, we find the maximal value M — vaax. a ^f}(M a p) 
and merge together all the documents x a , xp with 

(Map) = M. 

In the next step we remove all the words that appear 
solely in documents of one cluster. 

Unless we are left with one huge cluster of all the doc- 
uments we will repeat this procedure. In doing so we 
consider as single documents in the next step those that 
were merged in the current step. 

The parameters of this algorithm are Ar and 0. 



B. Document set resampling (DSR) 

This method is motivated by the good results obtained 
with WSR. It is meant to be an alternative that does not 
require such a high computing time as WSR. Whereas 
WSR calculates the entropy of the words only at a single 
stage of the subsample clustering, DSR tries to get hints 
for the good words by looking at the whole clustering of 
the subsamples. 

DSR is based on two assumptions that have been found 
to be true in our experiments: first we found that the re- 
ally useful words have a considerable coverage, i. e. these 
words appear in many documents. Second we assume 
that in agglomerative clustering, when we consider the 
number of mergings that have been done as a monotoni- 
cally decreasing measure of the clustering resolution, the 
cluster entropy of the good words often decreases earlier 
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with decreasing resolution than the cluster entropy of the 
bad words. The detailed description of the algorithm is 
as follows: 

We create ATr, subsamples XM c X, k = 1, . . . , A/r,, 
each consisting of n su b randomly drawn documents. 

To each subset XM we apply an agglomerative clus- 
tering algorithm. We number the successive mergings of 
the agglomeration and we will refer to the intermediate 
clustering solutions at step r. By r = 1 we refer to the 
initialization with each single document being in a sep- 
arate cluster and r = n su h corresponds to the final step 
where all documents are in a single cluster. 

We then consider the words that appear in at least 
"min documents of the current subsample and for each of 
these words we calculate the development of its cluster 
entropy in the early steps of the agglomeration process, 
i.e. r = 1,2, .. . ,0.7n sub . 

Thus at step r we find the clusters C; with the labels 
I € L r = {1, . . . , ?i su b — r + 1} and we calculate 



Mm - 



l£L r 



pW(l\ Wi ) log pW(l\ Wi ) (7) 



where 



P( fc )(i|t 



^ ^ fai J ^ ^ foti 



(8) 



aec, 



In order to highlight the effect of successive mergings 
on the cluster entropy we normalize the cluster entropies 
with respect to the initial cluster entropies at step r = 1 



(fc), 



(9) 



Now if} (r) decreases as the resolution becomes lower 
with successive mergings, i.e. as r increases, and finally, 
when all documents have been merged to the same clus- 
ter the entropy is zero for all words. 

We found that if we consider only the words the en- 
tropy of which decreases early in the agglomeration pro- 
cess we find a higher fraction of good words that have 
more value for the document clustering. However, the 
entropy of words that appear only in two or three doc- 
uments naturally decreases to zero within two or three 
mergings and thus (again) produces some sort of shot 
noise that spoils the statistics. Thus we consider only the 
words that appear in at least n m j n documents of the sub- 
sample and we keep track of the number of those words 
that have low entropy, i.e. we count 



(r)= ml k \r)<6} 



(10) 



At first the number of low-entropy- words qr increases 
slowly and later increases in larger steps. We consider the 
first increment 



A'*)(r) = q (k] '(r + 1) - q {k \r) 



that is significantly higher than the average increment as 
a cutoff criterion at which we decide to keep the words 
that have low entropy according to (O) at that stage. 
That is we look for the minimal value rri fulfilling 



A( fe )(ri fc) ) > (A«) + ^((AW - (AW)) 2 ) (12) 
and we select good words as 

w£ ) od = {w l \Hl k \rM)<8}. (13) 
Finally we consider the union 

^good = (J W^ Qd 



(14) 



k=l,...,N B 



(11) 



as our selected dictionary and we cluster all documents 
in X using the words (features) in Wg 00 d- 

The parameters of this algorithm are Nn, n su b, n mul 
and 9. 



IV. COMPARISON OF THE CLUSTERING 
METHODS 

A. The test dataset and the different experiments 

In order to compare the performance of the differ- 
ent clustering algorithms we extracted two test data sets 
from the known Reuters-21578 test database for text cat- 
egorization ||. This database contains 21578 Reuters 
news messages that were manually labeled as belonging 
to certain categories. We use the labels not for the clus- 
tering procedure but in order to evaluate the quality of 
the clustering results as will be described in the next sec- 
tion. 

For each of our two test datasets we took all documents 
of eight selected categories. In order to keep things simple 
we considered some preprocessing of the labels as helpful. 
Those few documents that have been labeled as belonging 
to more than one category were assigned unambiguously 
to the first label given in the database. One should keep 
in mind that the labels were given by human readers and 
are therefore subject to individual perception. 

The resulting test databases, i.e. the categories that 
are to be separated from one another and the number 
of documents extracted from the Reuters database are 
listed in table |. Please note that some categories ap- 
pear in both tasks. This is meant as shedding some light 
onto whether the good or bad separability of a category 
is an individual property of that category and its word 
field or if it rather a question of interference with another 
category using the same words. 

We found that the quality of clustering results is sen- 
sitive to the input dataset, particularly to the composi- 
tion of the news categories that are used. For instance, 
in clustering experiments done with the first database all 
clustering algorithms separate documents of the category 
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"coffee" much better than they separate documents of the 
category "oilseed". We thus concluded that the quality 
of the clustering results depends on the categories and 
the distribution of documents rather than on the indi- 
vidual documents. Obviously clustering documents of a 
certain category is easier if there is a set of discriminative 
words that are used in most documents of that category 
and only in those documents. 

Nevertheless the quality of the clustering results varies 
also when the same algorithm is applied to different ran- 
domly chosen subsets of the database with specified num- 
bers of documents from each category. In order to esti- 



mate the performance of the clustering algorithms on the 
clustering task we therefore create 50 different subsets 
of the test database each of which contains a specified 
number of documents from each category. The cluster- 
ing accuracy is then averaged among the 50 individual 
realizations of each experiment. 

In particular we constructed four experiments for each 
database. We chose 50 different subsets of 200, 500 and 
800 documents preserving the ratios of the number of 
documents in the categories and in another experiment 
the number of documents was the same in each category, 
see table |l[ 



category 


coffee 


cpi 


gnp 


money-supply 


oilseed 


ship 


sugar 


veg-oil 


no. of docs 


124 


75 


117 


113 


78 


204 


145 


93 


category 


trade 


crude 


grain 


money-supply 


interest 


ship 


sugar 


money-fx 


no. of docs 


441 


483 


489 


113 


263 


204 


145 


574 



TABLE I. The composition of the two test databases. 



experiment 


coffee 


cpi 


gnp 


money-supply 


oilseed 


ship 


sugar 


veg-oil 


200 


26 


16 


25 


24 


16 


43 


30 


20 


500 


65 


40 


62 


60 


41 


107 


76 


49 


800 


105 


63 


99 


95 


66 


172 


122 


78 


EQ 


64 


64 


64 


64 


64 


64 


64 


64 


experiment 


trade 


crude 


grain 


money-supply 


interest 


ship 


sugar 


money-fx 


200 


33 


36 


36 


8 


19 


15 


11 


42 


500 


81 


89 


90 


21 


48 


38 


27 


106 


800 


130 


143 


144 


33 


78 


60 


43 


169 


EQ 


100 


100 


100 


100 


100 


100 


100 


100 



TABLE II. The composition of the test datasets in the experiments. 



B. The clustering methods 

The representation of the documents was unchanged 
for all clustering methods. We use the bag of words rep- 
resentation described in the first section and applied a 
stoplist throwing out common words like "and" , "then" 
or "but" . After application of the stoplist the number of 
words is 5036 for the first database and 9019 for the sec- 
ond. In order to see the improvement achieved by the ap- 
plication of the stoplist we also clustered the documents 
using the "noisy" representation of the full dictionary. 
These cases are indicated by the letter "R" in tables [II 
andlfv. 



We applied the following clustering algorithms: 
ARG is an agglomerative method that is inspired by 
some ideas from Renormalization Group theory, a similar 
procedure can be found in pOj] . It uses the dot-product 
of two ^-normalized feature vectors as similarity mea- 
sure. The n x n-matrix of pairwise similarities of the 



data points is calculated in the beginning. Every sin- 
gle data point is considered a cluster. The algorithm 
then successively joins the two clusters which have high- 
est pairwise similarity. Similarity of the new cluster to 
another cluster is calculated from the similarities of the 
joined clusters as follows: 



s(C n ew, Ct) = V0.5(s 2 (C o idi, Q) + s 2 (C old2 , Q)). (15) 

AIB is an implementation of the agglomerative infor- 
mation bottleneck algorithm [[il"||i"2j |. Here the feature 
vectors are normalized with respect to the /i-norm and 
they are interpreted as discrete probability distribution 
functions. Each entry in the feature vector gives the 
probability of getting the corresponding word when one 
word is grabbed randomly from the text. The motivation 
of the information bottleneck principle is to successively 
join document clusters such that the loss of the mutual 
information between the cluster assignment of the doc- 
uments and the occurring words is minimal. This algo- 
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rithm has been applied to unsupervised text categoriza- 
tion. 

SPC is inspired by a model in theoretical physics. 
From the data a Potts spin model of an inhomogenous 
ferromagnet is constructed that inherits the structure of 
the data. The increasing fragmentation of domains of 
parallel Potts spins when the model magnet is simulated 
at increasingly higher temperatures yields a hierarchical 
cluster structure of the data We have applied this 

method using three different dissimilarity measures on 
the data, i.e. the li- and ^-distance measures and the 
Jensen-Shannon-divergence (JSD). Normalization of the 
feature vectors has been done also according to the cor- 
responding distance measure, i.e. l\ and 1% resp. and l\ 
for the JSD. Of these three alternatives we obtained the 
best results using the JSD. Though not very sensitively 
the results depend on the parameter k of the SPC al- 



gorithm which determines the number of bonds in the 
model magnet We found that the optimal value of 

k depends on the size of the dataset. We used k = 10 
for n = 200, k = 15 for n = 500 and k = 20 for n = 800 
to get the best results. The general dependence of the 
performance on the value of k is complex and remains for 
further investigation. 

PDDP has been proposed for text categorization by 
Boley |14j]. It is a very fast method that scales linearly 
with the number of documents. Here no distance mea- 
sure is calculated but the document set is recursively split 
into two pieces. The two subsets are separated by a hy- 
perplane that passes through the mean and is perpen- 
dicular to the direction of maximal variance of the data. 
The splitting of the clusters is iterated until a prescribed 
number of clusters, here we chose 64, has been reached. 



category 


coffee 


cpi 


gnp 


m-sup 


oilseed 


ship 


sugar 


veg-oil 


mean 


WSR 6 4 8 00 


0.931 


0.880 


0.767 


0.830 


0.508 


0.906 


0.730 


0.616 


0.771 


WSR 6 4 5 00 


0.926 


0.812 


0.770 


0.777 


0.531 


0.890 


0.839 


0.617 


0.770 


WSR 6 4 EQ 


0.924 


0.828 


0.764 


0.784 


0.515 


0.885 


0.815 


0.599 


0.764 


nSRn 200 


0.878 


0.829 


0.844 


0.889 


0.475 


0.769 


0.672 


0.580 


0.742 


DSR32 800 


0.888 


0.842 


0.873 


0.901 


0.463 


0.839 


0.620 


0.510 


0.742 


WSR 64 2 00 


0.912 


0.766 


0.769 


0.702 


0.476 


0.847 


0.807 


0.607 


0.736 


DSR32 500 


0.868 


0.810 


0.872 


0.905 


0.469 


0.805 


0.645 


0.529 


0.738 


DSR32 EQ 


0.866 


0.834 


0.838 


0.901 


0.472 


0.729 


0.667 


0.512 


0.728 


ARG 800 


0.914 


0.819 


0.722 


0.718 


0.453 


0.622 


0.752 


0.520 


0.690 


AIB 800 


0.825 


0.829 


0.834 


0.857 


0.435 


0.730 


0.522 


0.491 


0.690 


ARG 200 


0.918 


0.776 


0.691 


0.694 


0.451 


0.632 


0.771 


0.546 


0.685 


ARG 500 


0.916 


0.786 


0.709 


0.661 


0.468 


0.613 


0.769 


0.538 


0.682 


ARG EQ 


0.901 


0.817 


0.665 


0.697 


0.503 


0.606 


0.738 


0.528 


0.682 


AIB 500 


0.818 


0.799 


0.815 


0.812 


0.441 


0.750 


0.523 


0.497 


0.682 


AIB 200 


0.791 


0.811 


0.793 


0.783 


0.443 


0.718 


0.562 


0.521 


0.678 


SPC10 200 


0.864 


0.860 


0.794 


0.810 


0.423 


0.517 


0.608 


0.539 


0.677 


AIB EQ 


0.803 


0.827 


0.804 


0.852 


0.445 


0.656 


0.516 


0.494 


0.675 


SPC20 800 


0.877 


0.847 


0.863 


0.802 


0.476 


0.453 


0.595 


0.486 


0.675 


SPC15 EQ 


0.862 


0.835 


0.846 


0.841 


0.471 


0.502 


0.529 


0.492 


0.672 


SPC15 500 


0.853 


0.847 


0.831 


0.789 


0.454 


0.473 


0.606 


0.477 


0.666 


PDDP 800 


0.878 


0.730 


0.601 


0.617 


0.407 


0.720 


0.758 


0.477 


0.649 


PDDP 200 


0.850 


0.757 


0.609 


0.638 


0.418 


0.659 


0.717 


0.488 


0.642 


PDDP 500 


0.873 


0.721 


0.594 


0.612 


0.390 


0.695 


0.736 


0.476 


0.637 


R AIB EQ 


0.759 


0.789 


0.753 


0.746 


0.388 


0.653 


0.473 


0.476 


0.630 


PDDP EQ 


0.782 


0.709 


0.587 


0.504 


0.415 


0.536 


0.632 


0.463 


0.579 


RAND 200 


0.227 


0.230 


0.233 


0.232 


0.243 


0.342 


0.246 


0.227 


0.247 


RAND EQ 


0.205 


0.201 


0.200 


0.207 


0.200 


0.201 


0.197 


0.186 


0.200 


RAND 500 


0.168 


0.146 


0.166 


0.170 


0.142 


0.340 


0.198 


0.142 


0.184 


RAND 800 


0.154 


0.113 


0.150 


0.127 


0.115 


0.335 


0.171 


0.110 


0.159 



TABLE III. Maximal performance on the first database. The best value for each category 
is printed in bold letters. 



WSR has been described in the previous section. For 
the "probe clustering" we applied the ARG algorithm as 
described above. Further we chose 9 = 0.8, the optimal 
number of subsamples has been determined in a su- 



pervised way (see figure 1), the performance saturates at 
N R = 64. 

DSR has been used as described above. For clustering 
the subsamples as well as for the final round of clustering 



G 



the whole dataset with the reduced word set we employed 
the AIB algorithm. We found n su b = 100 and n m ; n = 5 
suitable for all experiments. Also we used 9 = 0.8 and 
again the number of subsamples Ar has been determined 
experimentally. We found good performance at about 32 
subsamples (see figure 1). However, further increasing 
the number of subsamples spoils the selection of "good" 
words and leads to a decrease of the performance. In the 
limit Ar, — > oo the word selection then degenerates to a 



cutoff criterion prescribing a minimal coverage. 

The number of "good" words that have been selected 
by the DSR method, when being applied to the 800 doc- 
uments experiment with Ar = 32, is 328±29 for the first 
database and 365±33 for the second database. 

RAND has been included to provide a baseline for 
the evaluation scheme. We produced trees by randomly 
agglomerating document clusters. 
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WSR«^ eo 


0.627 


0.845 


0.701 


0.664 


0.513 


0.698 


0.889 


0.453 


0.674 


WSR.64 2 00 


0.664 


0.861 


0.742 


0.662 


0.514 


0.541 


0.719 


0.652 


0.669 


DSR -jo EO 


0.665 


0.829 


0.606 


0.706 


0.593 


0.676 


0.708 


0.527 


0.664 


DSR-jo 500 


0.676 


0.847 


0.725 


0.633 


0.554 


0.546 


0.600 


0.623 


0.650 


DSR-<9 800 


0.680 


0.850 


0.731 


0.618 


0.553 


0.583 


0.543 


0.639 


0.650 


DSR32 200 


0.606 


0.818 


0.701 


0.668 


0.569 


0.523 


0.642 


0.607 


0.642 


SPC90 EO 


0.697 


0.802 


0.473 


0.771 


0.656 


0.445 


0.629 


0.478 


0.619 


AIB EQ 


0.643 


0.748 


0.504 


0.704 


0.582 


0.632 


0.624 


0.471 


0.614 


AIB 800 


0.668 


0.778 


0.673 


0.624 


0.525 


0.562 


0.453 


0.578 


0.608 


AIB 500 


0.661 


0.770 


0.668 


0.642 


0.548 


0.521 


0.472 


0.570 


0.606 


ARC 200 


0.637 


0.769 


0.576 


0.591 


0.602 


0.482 


0.632 


0.538 


0.603 


AIB 200 


0.620 


0.734 


0.639 


0.647 


0.558 


0.501 


0.538 


0.568 


0.600 


ARC 500 


0.631 


0.778 


0.596 


0.520 


0.566 


0.466 


0.652 


0.572 


0.597 


ARC EQ 


0.609 


0.752 


0.485 


0.610 


0.611 


0.552 


0.704 


0.436 


0.595 


ARC 800 


0.631 


0.774 


0.573 


0.494 


0.584 


0.478 


0.631 


0.576 


0.593 


SPC20 800 


0.638 


0.699 


0.667 


0.588 


0.639 


0.442 


0.342 


0.540 


0.570 


SPC15 500 


0.668 


0.730 


0.669 


0.576 


0.604 


0.435 


0.423 


0.559 


0.583 


SPC10 200 


0.644 


0.671 


0.647 


0.611 


0.569 


0.455 


0.516 


0.548 


0.583 


R AIB EQ 


0.632 


0.673 


0.418 


0.705 


0.544 


0.596 


0.591 


0.442 


0.575 


PDDP EQ 


0.642 


0.522 


0.403 


0.571 


0.557 


0.473 


0.659 


0.409 


0.530 


PDDP 200 


0.644 


0.587 


0.470 


0.521 


0.536 


0.424 


0.502 


0.452 


0.517 


PDDP 500 


0.689 


0.568 


0.471 


0.456 


0.551 


0.363 


0.456 


0.431 


0.498 


PDDP 800 


0.668 


0.543 


0.457 


0.419 


0.549 


0.345 


0.431 


0.396 


0.476 


RAND 200 


0.253 


0.285 


0.284 


0.271 


0.227 


0.225 


0.242 


0.331 


0.265 


RAND 500 


0.210 


0.245 


0.243 


0.158 


0.138 


0.138 


0.152 


0.332 


0.202 


RAND EQ 


0.198 


0.191 


0.193 


0.189 


0.178 


0.187 


0.182 


0.174 


0.187 


RAND 800 


0.180 


0.226 


0.227 


0.118 


0.103 


0.099 


0.106 


0.329 


0.173 



TABLE IV. Maximal performance on the second database. The best value for each cate- 
gory is printed in bold letters. 



C. Evaluation of clustering results 

We now want to check if the obtained clustering solu- 
tions provide good estimations of the categories as given 
by the labels. Ideally one wishes that a cluster contains 
all the documents of one category and only these. In re- 
ality we find that a cluster has documents with different 
category labels. In order to measure how close a cluster 
comes to this ideal case we define two numbers which are 
calculated for each cluster. The category to which the 
labels assign the most documents of a cluster is called 
the type of that cluster. As purity we define the fraction 



of documents of that type within the cluster and as ef- 
ficiency we define the number of documents the label of 
which is the cluster type divided by the overall number 
of documents with that label. 

Let C\ , C2 , • ■ • , Ci C X be the ideal clusters with respect 
to the labels, i.e. C\ contains all the documents that are 
labeled as belonging to category one, and let further be 
C a cluster found by the algorithm. Then the purity of 
C is defined as P{C) = max^ |CnCj|/|C| and the type of 
the cluster T(C) is the index i for which the expression 
on the right side is maximal. The efficiency accounts for 
the fraction of all documents of the category which are 
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gathered in the cluster: E{C) = \C n &r{C) \/\Ct{C)\- 

In order to have one quality measure that combines 
these two issues we use the commonly used F\ measure 
which was introduced by van Rijsbergen |]l5| and is de- 
fined as 



Ft 



2PE 
P + E' 



(16) 



The Fi measure considers purity and efficiency to be 
equally important for the quality of a cluster. It is eval- 
uated for each cluster in the entire tree and the best 
clusters with respect to each category are taken as the 
quality vector for that particular clustering experiment. 
The quality of each tree is the mean of the best F± -values 
for each category. Tables [II and [V show the values of 
the quality vector for the different methods and exper- 
iments. Each line in the two tables has been averaged 
among 50 different realizations of the corresponding ex- 
periment, i.e. by application to 50 different document 
sets with the same category distribution. 
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PDDP SPC ARG AIB DSR WSR 

FIG. 2. Direct comparison of the clustering methods. The 
bars indicate mean and standard deviation of the mean 
Fi -values obtained by applying the different methods to the 
800 documents experiments in the first (upper bars) and sec- 
ond (lower bars) database. 

As mentioned above we here put aside the interesting 
question of how to find the good clusters within the tree, 
i.e. estimate in an unsupervised way which clusters at 
which resolution are good approximations of the under- 
lying categories. 



Thus the values presented in tables III and IV are up- 



per bounds for what is achievable with any such search 
algorithm. We think that this problem can be separated 
from the basic clustering problem as the occurrence of 
good clusters in the tree is of course the prerequisite that 
limits all achievements of any algorithm that searches the 
tree for good clusters. 



— 1 1 1 1 1 1 1 1 

012345678 

log 2 N R 

1, , , , , , , , , r 




0.2 - 



— 1 1 1 1 1 1 1 1 ^ 

012345678 

log 2 Nr 

FIG. 1. Dependence of the clustering performance of the 
resampling methods WSR (applied to the 200 experiment, 
upper graph) and DSR (applied to the 800 experiment, lower 
graph) on the number of subsamples. Saturation occurs at 
64 subsamples for the WSR method. DSR works best at 
JVr = 32. The solid line corresponds to the first database, 
the dashed line to the second. 



ARG 


96 ±9 


DSR 


136 ± 32 


AIB 


286 ± 54 


SPC 


801 ± 50 


WSR 


9678 ± 1192 



TABLE V. Computational cost in seconds. All values 
correspond to applying the algorithms to the 800 experi- 
ment of the first database. The CPU has been PHI with 
650 MHz. PDDP was run under MATLAB and thus can- 
not be compared. 



V. CONCLUSIONS 

We find that the quality of the clustering results de- 
pends to a large extent on the dataset. In particular 
we observe that the performance as measured in this pa- 
per is almost always better on larger categories. When 
comparing the results of single categories in the 800 and 
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EQ experiments, we find that they are better in the case 
where the number of documents of that category is larger. 
For example in the second database categories "trade", 
"crude", "grain", "money- fx" are larger in the 800 ex- 
periment and the results are also better in the 800 ex- 
periment. All other categories are larger in the EQ ex- 
periment and also there are the better results for these 
categories. Also in this way the categories "ship" and 
"money-supply" can be better resolved in the context of 
the first database. 

Also we think that the resolvability of a category is 
influenced by interference with other categories in the 
dataset through an overlap of the characteristic word 
fields. We believe that if the characteristic words of a 
category are also used in documents of other categories 
that category can not be resolved as good as if there were 
no close categories. 

Comparing the results for the "money-supply" , "ship" 
and "sugar" categories in the EQ experiments of the 
two databases gives a clue to possible interference. We 
find that "money-supply" and "ship" are better resolved 
in the EQ experiments on the first database, whereas 
"sugar" is better in the second database. 

Further we observe that some categories appear to 
have a preferred algorithm or vice versa. So SPC per- 
formance in the EQ experiment of the second database 
peaks in categories "trade" , "money-supply" and "inter- 
est" whereas the results for the other categories are only 
moderate. 

However, the ranking of the performance of differ- 
ent clustering methods does not sensitively depend on 
the data. We found that the level of performance of 
SPC, ARG and AIB is almost the same. Results ob- 
tained with PDDP are not as good, whereas the advan- 
tage of this method is that it is much faster on large 
databases. PDDP does not require the calculation of a 
(dis-) similarity matrix and its time consumption scales 
linear with the number of documents. 

The feature selecting methods that we propose in this 
paper can improve the results. WSR yields the high- 
est performance but has on the other hand a very high 
computational cost. As it is implemented, the required 
time scales with n 3 . DSR gives moderate improvement 
of the clustering quality but is in comparison to WSR 
much faster. The time consumption of DSR is domi- 
nated by the size of the subsets. Thus for large datasets, 
if one can probe the discriminative words with compa- 
rably small subsets it will be faster than SPC, AIB and 
ARG that all rely on the computation of a complete dis- 
tance matrix on the basis of the whole word set. Another 
little advantage of the feature selecting methods is that 
the application of a stoplist becomes obsolete, WSR and 
DSR perform as good on the raw data matrix. 
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